In this notebook we'll play around with a pre-trained word model to look at its vocabulary and to try out some of the basic operations commonly performed on word vectors.
We'll start by using the Python package gensim which implements all of the basic features we need like loading the model, accessing its vocabulary, and performing similarity lookups. Immediately after, though, we'll go "under the covers" and perform the same operations manually so you can see what's really going on.
Along with the original word2vec papers, the authors released a large Word2Vec model that they trained on roughly 100 billion words from a Google News dataset. It contains exactly 3 million words, and the word vectors have 300 features each. There are newer, presumably better, pre-trained models, but this is the original.
Download the model file here (it's 3.39GB) and save it into a subdirectory: ./data/GoogleNews-vectors-negative300.bin
Once you've downloaded the model file, we'll use a helper function from gensim to load it as a KeyedVector class that provides a lot of convenience functions.
import gensim
filepath = './data/GoogleNews-vectors-negative300.bin'
# Load Google's pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format(filepath, binary=True)
I usually start by poking around the vocabulary of the model to get a feel for it. Let's print some random vocab words in three columns.
import random
# Retrieve the list of words in the vocabulary as 'vocab'
vocab = list(model.vocab.keys())
# Print 20 random words in two columns.
for i in range(10):
# Choose and print two random words
print('%30s %30s' % (random.choice(vocab),
random.choice(vocab)))
Certainly a lot of non-sense in there! I've found this to be fairly typical of pre-trained word models.
Let's look more explicitly at some different word types.
I'm going to define a little helper function which takes lists of words and then reports which are found, checking both lower and upper case.
def check_vocab(vocab, try_words):
print("%30s %s" % ('Word', 'Included'))
print("%30s %s" % ('====', '========'))
for word in try_words:
print("%30s %s" % (word, str(word in model.vocab)))
# If the word isn't already lower case, try lower case as well.
if not word.lower() == word:
print("%30s %s" % (word.lower(), str(word.lower() in model.vocab)))
Stop words
check_vocab(vocab, ['a', 'and', 'the'])
Multi-word names
check_vocab(vocab, ['Abraham_Lincoln',
'Michael_Jordan',
'Tom_Brady',
'Elon_Musk',
'United_States',
'United_States_of_America',
])
Multi-word topics
check_vocab(vocab, ['Computer_Science',
'Global_Warming',
'Foreign_Policy',
])
Idioms
check_vocab(vocab, ['couch_potato',
'dime_a_dozen',
'hit_the_sack',
'cut_corners',
])
Misspellings (these are all misspelled)
check_vocab(vocab, ['accomodate',
'begining',
'concious',
'incidently',
'recomendations',
])
Punctuation
check_vocab(vocab, ['man`s',
'man\'s',
'it`s',
'it\'s',
'U.S.A.',
])
Numbers
check_vocab(vocab, ['1',
'one',
'12',
'twelve',
'100',
'one_hundred',
])
Let's take a look inside a single vector, for the word "couch".
From the output we can see that the values appear to range within -1.0 to 1.0, and that the vector is dense rather than sparse (no features are zero).
%matplotlib inline
import seaborn as sns
import numpy as np
# Let's peek at the word vector for the word 'couch'.
vec = model.word_vec('couch')
# Shape and sample values:
print("Vector shape: " + str(vec.shape))
print("Sample values: <%.4f, %.4f, %.4f, ..., %.4f, %.4f>" %
(vec[0], vec[1], vec[2], vec[-2], vec[-1]))
# What's the vector's magnitude?
print("Norm: %.2f" % np.linalg.norm(vec))
# Are some values zero? How many?
num_zero = len(vec) - np.count_nonzero(vec)
print("Number of zeros: %d of %d" % (num_zero, len(vec)))
# Plot a histogram of the feature values to visualize their
# distribution.
ax = sns.distplot(vec, kde=False, rug=True)
t = ax.set_title('Histogram of Feature Values')
gensim includes convenience functions for computing a number of common word similarity operations.
The model object has convenience functions for comparing two vectors. The below code shows that "couch" and "book" have a low similarity, while "couch" and "sofa" are very similar--as we would hope.
# Let's try comparing some specific words.
# First, how similar are "couch" and "book"?
score = model.similarity('couch', 'book')
print("Cosine similarity between 'couch' and 'book' is %.2f" % score)
# How about "couch" and "sofa"?
score = model.similarity('couch', 'sofa')
print("Cosine similarity between 'couch' and 'sofa' is %.2f\n" % score)
We can also find the most similar words in the vocabulary to "couch".
The results look pretty sensible!
# What are the 10 most similar words to "couch" in the vocabulary?
results = model.most_similar(positive='couch', topn=10)
# Print out the results.
print("10 most similar words to 'couch':")
print("%20s %s" % ('word', 'score'))
for (word, score) in results:
print("%20s %.2f" % (word, score))
Fun side note: What's with the "al Jabouri slept" result? I believe this is an artifact of the Google News dataset. This particular phrase comes from this story which mentions an officer named al-Jabouri sleeping on a couch! A problem with this Google News dataset is that news outlets all over will take articles (from Reuters, I think?) and just modify them slightly, so there are many near-duplicate news articles out there.
It's great that gensim makes these operations easy for us, but to make sure we have a firm grasp on how they work, let's implement the above vector operations from scratch--just for educational purposes.
(Side Note: If you want to keep playing with gensim a little more, there's nice documentation for the KeyedVectors class here).
Let's start by pulling the word vectors matrix out of the model.
# Note - Older versions of gensim stored the vectors in `model.syn0`
vecs = model.vectors
print('Word vector matrix is: ' + str(vecs.shape))
Now let's pull out our specific word vectors manually by looking up their index from the vocabulary.
import numpy as np
# Let's look up the index of the vector for 'couch' and 'sofa'.
w1 = model.vocab['couch'].index
w2 = model.vocab['sofa'].index
# Select the vectors using their row index.
v1 = vecs[w1, :]
v2 = vecs[w2, :]
# Let's check out the norms of these two vectors.
print('Norm for "couch": %.2f' % np.linalg.norm(v1))
print('Norm for "sofa": %.2f' % np.linalg.norm(v2))
Here's the formula for the cosine similarity of two vectors 'x' and 'y'.
$ cos(\pmb x, \pmb y) = \frac {\pmb x \cdot \pmb y}{||\pmb x|| \cdot ||\pmb y||} $
The formula is written as the dot-product of the vectors, divided by the product of their magnitudes. However, we can change the order and normalize the vectors first, then take their dot products. It's better to think of it in this order--we'll see why in a bit.
# Normalize our vectors:
v1_norm = v1 / np.linalg.norm(v1)
v2_norm = v2 / np.linalg.norm(v2)
# Let's double check the result:
print('New norm of "couch": %.2f' % np.linalg.norm(v1_norm))
print('New norm of "sofa": %.2f' % np.linalg.norm(v2_norm))
# Now we can take the dot-product of the normalized vectors:
cos_sim = np.dot(v1_norm, v2_norm)
print('\nCosine similarity between "couch" and "sofa": %.2f' % cos_sim)
# Also show the gensim results as a sanity-check
print(' (gensim: %.2f)' % model.similarity('couch', 'sofa'))
Now let's try searching the vocabulary for the top-10 most similar words to "couch".
We could iterate over all 3M words in the vocabulary, calculating the cosine similarity as we did above, and then sorting them. This is brutally slow, though! As you may know, it's much more efficient to perform vector-matrix operations. We'll multiply the vector against the whole matrix, and the processor will be able to use efficient linear algebra routines and SIMD instructions to speed up this heavy compute task.
Here is where it's going to help us to change up the order of operations. If we simply normalize the entire word vector matrix as pre-processing step, then we never have to worry about the normalization step again!
We start by calculating the norms for all 3M vectors.
%%time
print("Calculating vector norms, this can be slow...\n")
# First, numpy can calculate the norms of all of our vectors.
# We specify that we want the norms calculated along the first axis,
# since these are row vectors.
norms = np.linalg.norm(vecs, axis=1)
Now we divide the vectors by their norms.
%%time
print("Shape of vecs: " + str(vecs.shape))
print("Shape of norms: " + str(norms.shape))
# Add a second dimension to norms, so that it's 3M x 1.
norms = norms.reshape(len(norms), 1)
print("\nNormalizing all vectors, this can be slow...")
# Vecs is [3M x 300] and norms is [3M x 1]. Performing division
# will result in each row of 'vecs' being divided by the scalar
# in the corresponding row of 'norms'.
vecs_norm = vecs / norms
# Sanity check...
print("\nNew norm of first vector: %.2f \n" %
(np.linalg.norm(vecs_norm[0, :])))
Now that we have the normalized vectors, we can calculate the cosine similarities for 'couch' and all 3M vocabulary words!
%%time
# Look up the index of the vector for 'couch'.
w_i = model.vocab['couch'].index
# Select the *normalized* vector using the row index.
v_norm = vecs_norm[w_i, :]
# For our matrix-vector multiplication, we need v_norm
# as [300 x 1]
v_norm = v_norm.reshape(len(v_norm), 1)
print('Calculating all word similarities...\n')
# Perform the matrix-vector multiplication.
# vecs_norm * v_norm = all_sims
# [3M x 300] * [300 x 1] = [3M x 1]
all_sims = vecs_norm.dot(v_norm)
# Remove the extra dimension from the similarity values.
all_sims = all_sims.flatten()
The final step is simply to sort the results and display them.
%%time
# The gensim class contains a list of all words in the vocabulary.
# We'll need this list in order to map back from row indeces to
# their words.
vocab_words = model.index2word
# Turn the similarities vector into a list of tuples in the form
# (index, similarity)
# e.g.,
# [(0, 0.03), (1, 0.20), (2, 0.08), ...]
results = enumerate(all_sims)
print("Sorting similarities...\n")
# Now sort the list of tuples by the similarity value.
# Sort descending, with highest similarity first.
results = sorted(results, key=lambda x:x[1], reverse=True)
print("Top 10 most similar words to 'couch':")
# Display the top 10 results and their similarity.
for i in range(10):
# Get the word index for result 'i'.
word_index = results[i][0]
# Lookup the word.
word = vocab_words[word_index]
# Print the word and its similarity value.
print('%20s %.2f' % (word, results[i][1]))
print('')
A small misconception around cosine similarity is that it is always positive, but this is not the case for vectors which contain negative feature values (such as word vectors). The cosine of 180 degrees is -1, so two vectors pointing in opposite directions will have cosine similarity -1.
Just for fun, what does the model think are the least similar words to "couch"?
print("Top 10 least similar words to 'couch':")
# Display the last 10 results and their similarity values.
# (Python let's us iterate backwards through a list with
# negative indeces).
for i in range(-1, -11, -1):
# Get the word index for result 'i'.
word_index = results[i][0]
# Lookup the word.
word = vocab_words[word_index]
# Print the word and its similarity value.
print('%20s %.2f' % (word, results[i][1]))
print('')
I don't know about you, but when I think about the opposite of "couch" the first thing that comes to mind is definitely the Basmati Growers Association! :)
Another "just for fun" exercise, here's a faster way to sort the results.
Numpy's argsort function returns just the sorted indeces, which is enough.
%%time
# Sort the similarities but return the sorted *indeces*.
results2 = np.argsort(all_sims, axis=0)
# For the top 10 results...
for i in range(-2, -12, -1):
# Get the word index for result 'i' (in reverse order).
word_index = results2[i]
# Lookup the word.
word = vocab_words[word_index]
# Lookup the calculated similarity value.
sim = all_sims[word_index]
# Print the word and its similarity value.
print('%20s %.2f' % (word, sim))